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A SMALL HARDWARE IMPLEMENTATION OP THE SOBBYTE FUNCTION OP RIJNDAEL 

1. Field of the Invention 

The present invention relates to the field of data encryption. The invention relates 
5 particularly to an apparatus and method for a small hardware implementation of the SubByte 
function found in the Advanced Encryption Standard (AES) algorithm or Rijndael Block 
Cipher, hereinafter AES/Rijndael, Hie accommodating is redesigned to work with both 
inverse and normal processing. 

10 2. Discussion of the Related Art 

The current state of the art provides for hardware implementations where the inverse 
cqjher can only partially re-use the circuitry that implements the cipher. For high-speed 
networking processors and Smart Card applications a very small (gate size) and high data- 
15 rate (accommodating an Optical Ctoier Rate of OC-192 and beyond 9953J28 Mbps and a 
payload of 9.6 Gbps) are desirable. 

The AES/Rijndael is an iterataed block cipher and is described in a pn>posal written 
by Joan Daemen and Vincent Rijmen and published in March 9. 1999. The National 
Institute of Standards and Technology (NIST) has approved the AES/Rijndael as a 

20 cryptographic algorithm and published the AES/Rijndael in November 26. 2001 (Publication 
197 also known as Federal Information Processing Standard 197 or "FTPS 197") which is 
hereby incorporated by reference as if fully set forth herein). In accordance with many 
private key encryption/decryption algorithms, inchiding AES/Rijndael. 
encryption/decryption is performed in multiple stages, commonly known as iterations or 

25 rounds. Such algorithms lend themselves to a data processing pipeline or pipelines 

architecture, m each romid, the AES/Rijndael uses the affine transformation and its inverse 
along with other transfom^ations to decrypt (decipher) and encrypt (encipher) information. 
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Enciyption converts data to an unintelligible fonn called cipher text; decrypting the 
ciphertext converts the data back into its original form, called plaintext. 

The input and output for the AES/Rijndael algorithm each consist of sequences of 
128 bits (each having a value of 0 or 1). These sequences are conunonly be referred to as 
5 blocks and the number of bits they contain are referred to as their length ("FIPS 197", NIST, 
p. 7). The basic unit for processing in the AES/Rijndael algorithm is a byte, a sequence of 
eight bits treated as a single entity with most significant bit (MSB) on the left. Internally, flie 
AES/Rijndael algorithm's operations are performed on a two dimensional array of bytes 
called the State. The State consists of four rows of bytes, each containing Nb bytes, where 
10 Nb is the block length divided by 32 ("FIPS 197", NIST, p. 9). 

At the start of the Cipher and Inverse Cipher (encryption and decryption), the input - 
the array of bytes 

ino, ini, ... inis 

is copied into the State array as illustrated in FIG 1. The Cipher or Inverse Cipher operations 
15 are then conducted on each byte in this State array, after which its final values are copied to 
the output — the array of bytes 

outo, outi, ... OUtl5. 

The addition of two elements in a finite field is achieved by "adding" the coefficients for the 
conesponding powers in the polynomials for the two elements. The addition is performed 
20 with the boolean exclusive XOR operation ('TIPS 19T'.NIST.p 10). The binary notation for 
adding two bytes is: 

{01010111} © {10000011} = {11010100} 
(1.0) 

In the polynomial representation, multiplication in GF(2«) corresponds with the 
25 multiplication of polynomials modulo an irreducible polynomial of degree 8. A polynomial 
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is iiieducible if its only divisors are one and itself. For the AES/Rijndael algorithm, this 
irreducible polynomial is ('TIPS 197", MIST, p. 10): 

m(x>= x^ + x'^ + x^ + x+l 

(1.1) 

A diagonal matrix with each diagonal element equal to 1 is called an identity matrix. 
The n X 71 identity matrix is denoted In: 



1 0000 
0 10 0 0 
00 100 
0 0 0 1 0 
0 0 0 0 1 



(1.2) 



If A and B and » x « matrices, we call each an inverse of the other ifi 
AB = BA = I„ 

(1.3) 

A transformation consisting of multipHcation by a matrix foBowed by the addition of 
a vector is called an Affine Transformation. 

The SubByteO function of AES/Rijndael is a non-linear byte substitution fliat 
operates independently on each byte of the State using a substitution table (S-box). This S- 
box, which is invertible, is constructed by composing two transformations: 

1 . Take the multiplicative mverse in the finite field GF(2*), described eaxUer, the 
element {00} is mapped to itself. 

2. Apply the following aSine transformation (over GF(2)): 

bl' =b (i>mod8 ® b <H-4>mod8 ® b(i+5)mod8 © b <i+<5>mod8 © b <H-7>n.od8 © Cj 
(1.4) 

In matrix form, the afiBne transformation element of the S-box can be expressed j 
C'FIPS 197",NIST,pl6): 
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(1.5) 



If fhis were implemented as the lookup table as suggested by the AES/Rijndael 
proposal, a 256 entry ROM or multiplexor would be required. To implement the 
AES/Rijndael algorithm, 12 instantiations of this table would be required. The inveise of 
tiiis matrix can be found as: 
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(1.6) 



If this was implemented as the lookup table suggested by the AES/Rijndael proposal, a 128- 
entiy, 1 6-bit word ROM or multiplexor would be required. To implement the AES/Rijndael 
algorithm, 12 instantiations of this table would be required. 

Thus there is a need for a system and a method of sharing ahnost all the circuitry for 
the afRne transformation in order to reduce gate count To achieve a high data-rate and 
small gate size the design must be architected so that the maximum path is not significantly 
longer and the gate size is so smaU that the design can be repKcated to promote parallel 
processing without greafly increasing the die size. Increasing die size adds more expense 
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and power consumption, making Ihe product less marketable. The present invention is an 
apparatus and a method for decreasing the gate size and at the expense of slightly increasing 
the maximum path delay. This makes the circuit smaller and thus more attractive for high 
data-rate designs. 

Each occurrence in the AES/Rijndael of the pair of affine transform and inverse 
affine transform is reduced by the present invention to one transform, the AfSne-All 
transform. In a preferred embodiment, a circuit performs both normal and inverse afCiiie 
transformations with very little duplicate logic. In this preferred embodiment, by 
implementing the AfSne-All transform with a Multiplicative Inverse ROM, the logic is 
greatly reduced and the maximum path delay is reduced compared to a multiplexor 
implementation while only being slightly greater than for a ROM implementation 

Thus, the preferred embodiment of the present invention employs a read-only 
memory (ROM) for the multiplicative inverse and a reduced combinational logic 
implementation for the affine transformation. This implementation is very low in gate coimt 
with a very comparable maximum delay path. 

FIG. 1 illustrates state array input and output ("FIPS 197", nist, p.9) 

FIG. 2 illustrates comparison of prior art ROM and lookup table (multiplexor) 
implementation of the subbyte function with Affine-All implementation of the present 
invention. 

FIG. 3 illustrates flie ROM or lookup table used willi the Affine-All transformation of 
the present invention. 

FIG. 4 illustrates the netlist of the Afifine-AU combinational logic. 

The present invention is based, in part, on the fact that beginning at the last row each 
row of matrix equations (1 .5) and (1.6) is shifted left by one bit from the previous row. In 
the present invention, the first row of each matrix is termed the "load pattern". So the "load 
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pattern" for flie aflBne transform matrix is {1 0001 1 1 1 } and the "load pattern" for the inverse 
affine transform is {00100101}. Note that the number ofO*s in each "load pattern" is an odd 
number and is an important characteristic in being able to merge the two transformations into 
one circuit in the system and method of the present invention. 

If both afiSne transformations are implemented as suggested by Daemen and Rijmen 
("FIPS 197") using exclusive OR gates the circuit equations look as follows: 
Affine Transform Equations 
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Notice that each equation has an odd number of tenns and the same number of terms: five. 
The addition of the vector determines the negation of some equations. So the number of 
terms in each equation is determined by the "load pattern". The number of negations is 
determined by Ihe addition of the vector which is termed the "load vector". 
20 Inverse AfGne Transform Equations 
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Each equation has an odd number of terms and Ihe same number of terms: three. The 
30 addition of the vector determines the negation of some equations. So the number of terms in 
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each equation is determined by the "load pattern". The number of negations is determined 
by the addition of the vector. 

This addition vector can now be used as a "load vector^* as well. Looking at the two 
sets of equations it appears that there is no common logic to be merged. If the equations are 
5 rewritten with the "load pattern" included and use the addition of the vector to determine the 
negations, a common ckcuit is revealed. The properties of the exclusive OR are used to 
accomplish this and these properties aie: 

AffiB®C = C®B©A (L9) 
A©0«A (2.0) 
A®1»^ (2J) 

^^^'^ (2.2) 

In a preferred embodiment, the circuit implementing both the afSne and inverse affine 
10 transforms comprises a Multiplicative Inverse ROM and the logic fliat represents both 

transforms is as follows with p as the "load pattern" and v as the "load vector". For example, 
here is what equation seven of the a£Bne matrix becomes: 

b'7=[(bo S p,)p(bi S P2)P(b2 S P3)p(b3 S P4)p(b4 £ P5)p(b5 S P6)p(b6 S Pl)pQj7 = Po)]pV7 

(2.3) 

1 5 The number of instantiations has been cut in half. Because of the O's produced by the 

ANDmg of p and b, the equation works for both affine and inverse afBne transformations. 
Because b XOR'ed with a 1 is always the inverse of b, using V7 each time negates the 
equation where appropriate. 
Comparisons: 
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Using the design suggested by the AES/Rijndael proposal (FIPS 197) implemented in 

two ways: 

(1) a 128-entty, 16-bit word ROM, and 

(2) a 128-entry, 16-bit word lookup table implemented as a multiplexor, 

the ROM, Multiplexor and the Affine-AU circuit embodiment of the present invention were 
synthesized and timed using maximum pafli analysis. FIG. 2 compares results where sizes 
in gates are given as well as sizes in microns for comparison with the ROM implementation. 
Net area is not considered because wire load models diff&c with technologies. 

A preferred embodiment of the ROM or Lookup table contains the values shown m 
FIG. 3, in hexadecimal format 

The net list of the Affine-AU combinational logic of a preferred embodiment is 
shown in FIG. 4. The code for an implementation is included as Appendix A. 

The present invention is applicable to all systems and devices capable of secure 
communications, comprising security networking processors, secure keyboard devices, 
magnetic card reader devices, smart card reader devices, and wireless 802.1 1 devices. 

The above describe embodiments are only typical examples, and their modifications 
and variations are apparent to (hose skUled in the art. Various modifications to the above- 
described embodiments can be made without departing from the scope of the invention as 
embodied in the accompanying claims. 

APPEISDIXA 

The RTL to Implement the affine all circuit Is shown below: 
'timescale lOns/lOns 
module aes_affine_all 
( 

byteOut, // output byte 

bytein, // input byte 

enCrypt // 1 = encrypt 0 = decrypt 

); 

// — 



// ports 
// 
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input enCrypt; 
input [7:0] bytein; 
output [7:01 byteOut; 

5 // Logic reduction 

wire [4:01 byteOutJnt; 
wire [0:7] yjnv,y,yjnt; 

wand byteOut„7:.0,byteOut_7_l,byteOut_7_2,byteOut_7_3,byteOut 7 4,byteOut 7 5, 
byteOnt_7_6,byteOut_7_7; 

10 wandbyteOut_4_0,byteOut_4J,byteOnt_4_2,byteOut_4_3,byteOut_4 4,byteOnt 4 5, 
byteOut_4_6,byteOut_4_7; " 
wand 

byteOutJnt_4_0,byteOutJnt_4_l,byteOutJnt_4_2,byteOut_int_4_3,byteOutJnt 4 4, 
byteOutJnt_4_5, byteOut_int_4_6, byteOutJnt_4_7; 
15 wand 

byteOutJnt_3_0,byteOut_int_3 l,byteOutJnt_3_2,byteOut_lnt_3_3,byteOut int 3 4, 
byteOutJnt_3_S,byteOut_int_3_6, byteOutJnt_3_7; - - - 

wand byteOut_3_0,byteOut_3_l,byteOut_3_2,byteOut 3 3,byteOut 3 4,byteOut 3 5, 
byteOut_3_6,byteOut_3_7; " " - - w « ^ , 

20 wand 

byteOutJnt_2_0,byteOut_int_2_l,byteOut_int_2_2,byteOutJnt_2_3,byte int 2 4, 

byteOut_int_2_5,byteOutJnt_2_6, byteOut_lnt_2_7; 

wand 

byte0utjnt_l_0,byte0utjnt_lj,byte0utjntj_2,byte0utjnt_l_3,byte0utjn^ 1 4, 
25 byteOntJnt J_5,byteOutJnt J_6, byteOut Jnt_l_7; 
wand 

byteOutJnt_.0_^0,byteOutJnt_0J,byteOut_int_0_2,byteOutJnt_0_3,byteOut int 0 4, 

byteOut_int_D_5,byteOut_int_0_6, byteOntJnt_0_7; - - - 

assign y_inv = 8'bOOlOOlOl; 
30 assign y = 8'blOOOllll; 

assign yjnt = (enCrypt) ? y : y_inv; 

assign byteOut_7_0 = bytein [0]; 

assign byteOut_7_0 y_int[l]; 

assign byteOut_7_l = bytein [1]; 
35 assign byteOut_7_l - y_int[21 ; 

assign byteOut_7_2 = bytein [21; 

assign byteOut_7_2 = y_int[3]; 

assign byteOut_7_3 = bytein [31 ; 

assign byteOut_7_3 == yjnt[4]; 
40 assign byteOut_7_4 = bytein [41 ; 

assign byteOut_7_4 = y_int[51; 

assign byteOut_7_5 = bytein [5]; 

assign byteOut_7_5 = yjnt[6]; 

assign byteOut_7_6 = bytein [61; 
45 assign byteOut_7_6 = y_int[71 ; 

assign byteOut_7_7 = bytein [71; 

assign byteOnt_7_7 = y_lnt[01; 

assign byteOut [7] = byteOut_7_0'^ byteOut_7 J'^ byteOut_7_2'^ byteOut_7 3'^ 
byteOut_7_4'^ byteOut_7_5^ byteOut_7 6^^ byteOut_7_7; 
50 assign byteOut_int_4_0 = bytein [0]; 
assign byteOut_int_4_0 = y_int[21; 
assign byteOut_int_4_l = bytein [11; 
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assign byteOut_iiit_4_l = yjnt[31; 
assign byteOut_int_4_2 = bytein [2]; 
assign byteOut_int_4_2 = y_int[41; 
assign byteOutJnt_4_3 = bytein [3]; 
5 assign byteOat_intjl_3 = y_int[5J ; 
assign byteOutJnt_4_4 = bytein [4]; 
assign byteOutJnt_4_4 = yjnt[6]; 
assign byteOut_int_4_5 == bytein [51; 
assign byteOut_int_4_S = yjnt[71; 
1 0 assign byteOut Jnt_4_6 = bytein [6]; 
assign byteOut_int_4__6 = yjnt[0]; 
assign byteOutJnt_4_7 = bytein [71; 
assign byteOut_int_4 J7 = y_int[l J ; 

assign byteOutJnt [4] = byteOut_int_4_0^ byteOut_int_4J^ byteOut Int 4 2^^ 
15 byteOutJnt_4_3^ byteOutJnt_4_4'^ " 

byteOutJnt_4_5'^ byteOut2int_4_6'^ byteOut_int_4_7; 

assign byteOutJnt_3_0 = bytein [0]; 

assign byteOutJnt_3_0 = y_int[3]; 

assign byteOut Jnt_3_l = bytein [1] ; 
20 assign byteOnt_int_3_l = y_int(4J; 

assign byteOut_int_3_2 = bytein [2]; 

assign byteOutJnt_3_2 = y_int[5J; 

assign byteOutJnt_3__3 = bytein (3]; 

assign byteOutJnt_3_3 = yjnt[61; 
25 assign byteOnt_int_3_4 = bytein [4] ; 

assign byteOutJnt_3_4 = y_int[71; 

assign byteOut Jnt_3_5 = bytein [51 ; 

assign byteOutJnt_3_5 yjnt[01; 

assign byteOutJnt_3_6 = bytein [61; 
30 assign byteOut_int_3_6 = y_int[l] ; 

assign byteOntJnt_3_7 = bytein [71; 

assign byteOutJnt_3_7 = yjnt[2]; 

assign byteOutJnt [3] =byteOutJnt_3 byteOutJnt^SJ'^ byteOut int 3 2^^ 

byteOutJnt_3_3'^ byteOutJnt_3_4'^ ~ 
35 byteOut_int_3_5^ byteOutJnt_3_6'^ byteOut_int_3_7; 

assign byteOut_4_0 = bytein [01; 

assign byteOut_4_0 = y_int[41; 

assign byteOut_4_l = bytein [1]; 

assign byteOut_4_l = yjnt[51; 
40 assign byteOut_4_2 = bytein [21 ; 

assign byteOut_4_2 == y_int[61; 

assign byteOut_4_3 = bytein [3]; 

assign byteOut_4_3 = y_int[71; 

assign byteOut_4_4 = bytein [41; 
45 assign byteOut_4_4 = y_int[01 ; 

assign byteOut_4_5 = bytein [S]; 

assign byteOut_4_5 = yjnt[ll; 

assign byteOut_4_6 = bytein [61; 

assign byteOut_4_6 = y Jnt[2] ; 
50 assign byteOut_4_7 = bytein [71; 

assign byteOut_4_7 = yjnt[31; 
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assign byteOut [4] =byteOut_4_0'^ byteOut.4_l'^ byteOut_4_2'^ 

byteOut_4_3'^ byteOut_4_4'^ 

byteOut_4_5'^ byteOut_4_6'^ byteOut__4_7; 

assign byteOut_3_0 = bytein [0]; 
5 assign byteOut_3_0 = y_int(5] ; 

assign byteOut_3_l = bytein [1]; 

assign byteOut_3_l == y_int[61; 

assign byteOnt_3_2 = bytein [2]; 

assign byteOut_3_2 = yjnt[7]; 
10 assign byteOut_3_3 = bytein [3]; 

assign byteOut_3_3 = y Jnt[01 ; 

assign byteOut_3_4 = bytein [41; 

assign byteOutJ3_4 = y_int[ll; 

assign byteOnt_3_5 = bytein [5]; 
1 5 assign byteOut_3_5 == y JntI2] ; 

assign byteOut_3_6 = bytein [6] ; 

assign byteOut_3_6 = y_int [3] ; 

assign byteOat_3_7 == bytein [7J; 

assign byteOut_3_7 = yjnt[4]; 
20 assign byteOut [3] = byteOut_3_0'^ byteOut_3_l'^ byteOut_3_2'^ 

byteOut_3_3'^ byteOnt_3_4'^ 

byteOut_3_5'^ byteOut3_6'^ byteOut_3_7; 

assign byteOut Jnt_2_0 = bytein [0]; 

assign byteOut_int_2_0 = y_int[6] ; 
25 assign byteOut J[nt_2_l = bytein [1] ; 

assign byteOut_int_2 J = yjnt[7]; 

assign byteOut Jnt_2_2 = bytein [2]; 

assign byteOut_int_2_2 = yjnt[0]; 

assign byteOut_int_2_3 = bytein [3]; 
30 assign byteOut_int_2_3 = y_int[ll ; 

assign byteOut_int_2_4 = bytein [4]; 

assign byteOut Jnt_2_4 = y_int[2]; 

assign byteOut_int_2_S = bytein [5]; 

assign byteOut_int_2_5 = y_int[3]; 
35 assign byteOut_int_2__6 = bytein [6] ; 

assign byteOut_int_2_6 = y_int[41; 

assign byteOut Jnt_2_7 = bytein [7J; 

assign byteOut_int_2_7 = y_intI5]; 

assign byteOut Jnt [21 =(-byteOutJnt_2 0 & byteOut int 2 II --byteOut int 2 1 & 
40 byteOut_int_2_0)'^ ---i^ 

(-byteOut_int_2_2 & byteOut Jnt_2_3 | HbyteOut_int_2_3 & byteOut Jnt_2_2)^ 

(-byteOut Jnt_2_4 & byteOut Jnt_2_5 | HbyteOut_int_2_5 & byteOut Jnt_2_4)'^ 

(--byteOut Jnt_2_6&byteOut_int_2_7 1 -byteOut Jnt_2 7&byteOut_int_2 6); 

assign byteOut Jnt_l_0 = bytein [0] ; 
45 assign byteOut_int_l_0 = y_int[7]; 

assign byteOut Jnt_l_l = bytein [IJ; 

assign byteOut_int_l_l = y_int[0]; 

assign byteOut Jnt_l_2 == bytein [2]; 

assign byteOut Jnt_l_2 = y_int[ll ; 
50 assign byteOut Jnt_l_3 = bytein (3]; 

assign byteOut Jnt J_3 = yjnt[21; 
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assign byteOutJntJ_4 = bytein [4]; 
assign byteOut_mt_l_4 = y_int[3J; 
assign byteOutJnt = bytein [51; 
assign byteOutJntJ_5 = y_int[4I; 
5 assign byteOut_int_l_6 = bytein [6]; 
assign byteOutJnt_l_6 = yjnt[5]; 
assign byteOut_int_l__7 = bytein [7J; 
assign byteOut_int_l_7 = y_int[6]; 

assign byteOut_int [1] =byteOutJnt_l_0^ byteOutJnt_l_l^ byteOut int 1 2"^ 
10 byteOut_int_l_3'^ byteOut_int_l_4^ " 

byteOutJnt_l_5'^ byteOutJnt_l_6'^ byteOut^intJ 7; 

assign byteOut Jnt_0_0 = bytein [0] ; 

assign byteOut_int_0_0 = yjnt[0]; 

assign byteOutJnt_0_l = bytein [IJ; 
1 5 assign byteOut Jnt_0_l = y_int[l] ; 

assign byteOutJnt_0_2 = bytein [2]; 

assign byteOutJnt_0_2 = y_int[21; 

assign byteOutJnt_0__3 = bytein [3]; 

assign byteOut_int_0_3 = y_int[3]; 
20 assign byteOut_int_0_4 = bytein [4] ; 

assign byteOutJnt_0_4 = y_int[4]; 

assign byteOutJnt_0_5 = bytein [SJ; 

assign byteOutJnt_0_5 = yJnt[5J; 

assign byteOnt_int_0_6 = bytein 16]; 
25 assign byteOut Jnt_0_6 = y Jnt[61 ; 

assign byteOutJnt_0_7 = bytein [7J; 

assign byteOutJnt_0_7 = y_int[71; 

assign byteOutJnt [0] ^byteOutJnt^O^C^ byteOut_^int 0_1 byteOut int 0 2^^ 

byteOut_int_0_3'^ byteOut_int_0_4'^ " 

30 byteOut Jnt_0_S^ byteOut_int_0_6'^ byteOut_int_0__7; 

assign byteOut [6] = (enCrypt) ? -byteOut Jnt[41: byteOutJnt[4]; 

assign byteOut [SJ = (enCrypt) ? HbyteOutJnt[3J: byteOut Jnt [3]; 

assign byteOut [2] = (enCrypt) ? byteOut int [2] : -byteOutJnt [2]; 

assign byteOut [1] = (enCrypt) ? --byteOutJnt[ll: byteOut^intll]; 
35 assign byteOut [0] = -byteOut Jnt [0]; 

endmodule 
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